approximation architecture
Non-parametric Approximate Dynamic Programming via the Kernel Method
This paper presents a novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful approximation and sample complexity guarantees. In particular, we establish both theoretically and computationally that our proposal can serve as a viable alternative to state-of-the-art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. We accomplish this by developing a kernel-based mathematical program for ADP. Via a computational study on a controlled queueing network, we show that our procedure is competitive with parametric ADP approaches.
Unsupervised Basis Function Adaptation for Reinforcement Learning
When using reinforcement learning (RL) algorithms to evaluate a policy it is common, given a large state space, to introduce some form of approximation architecture for the value function (VF). The exact form of this architecture can have a significant effect on the accuracy of the VF estimate, however, and determining a suitable approximation architecture can often be a highly complex task. Consequently there is a large amount of interest in the potential for allowing RL algorithms to adaptively generate (i.e. to learn) approximation architectures. We investigate a method of adapting approximation architectures which uses feedback regarding the frequency with which an agent has visited certain states to guide which areas of the state space to approximate with greater detail. We introduce an algorithm based upon this idea which adapts a state aggregation approximation architecture on-line. Assuming $S$ states, we demonstrate theoretically that - provided the following relatively non-restrictive assumptions are satisfied: (a) the number of cells $X$ in the state aggregation architecture is of order $\sqrt{S}\ln{S}\log_2{S}$ or greater, (b) the policy and transition function are close to deterministic, and (c) the prior for the transition function is uniformly distributed - our algorithm can guarantee, assuming we use an appropriate scoring function to measure VF error, error which is arbitrarily close to zero as $S$ becomes large. It is able to do this despite having only $O(X\log_2{S})$ space complexity (and negligible time complexity). We conclude by generating a set of empirical results which support the theoretical results.
Unsupervised Basis Function Adaptation for Reinforcement Learning
Barker, Edward W., Ras, Charl J.
When using reinforcement learning (RL) algorithms to evaluate a policy it is common, given a large state space, to introduce some form of approximation architecture for the value function (VF). The exact form of this architecture can have a significant effect on the accuracy of the VF estimate, however, and determining a suitable approximation architecture can often be a highly complex task. Consequently there is a large amount of interest in the potential for allowing RL algorithms to adaptively generate approximation architectures. We investigate a method of adapting approximation architectures which uses feedback regarding the frequency with which an agent has visited certain states to guide which areas of the state space to approximate with greater detail. This method is "unsupervised" in the sense that it makes no direct reference to reward or the VF estimate. We introduce an algorithm based upon this idea which adapts a state aggregation approximation architecture on-line. A common method of scoring a VF estimate is to weight the squared Bellman error of each state-action by the probability of that state-action occurring. Adopting this scoring method, and assuming $S$ states, we demonstrate theoretically that - provided (1) the number of cells $X$ in the state aggregation architecture is of order $\sqrt{S}\log_2{S}\ln{S}$ or greater, (2) the policy and transition function are close to deterministic, and (3) the prior for the transition function is uniformly distributed - our algorithm, used in conjunction with a suitable RL algorithm, can guarantee a score which is arbitrarily close to zero as $S$ becomes large. It is able to do this despite having only $O(X \log_2S)$ space complexity and negligible time complexity. The results take advantage of certain properties of the stationary distributions of Markov chains.
Non-parametric Approximate Dynamic Programming via the Kernel Method
Bhat, Nikhil, Farias, Vivek, Moallemi, Ciamac C.
This paper presents a novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful, dimension-independent approximation and sample complexity guarantees. In particular, we establish both theoretically and computationally that our proposal can serve as a viable alternative to state-of-the-art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. We accomplish this by developing a kernel-based mathematical program for ADP. Via a computational study on a controlled queueing network, we show that our non-parametric procedure is competitive with parametric ADP approaches.
Actor-Critic Algorithms
Konda, Vijay R., Tsitsiklis, John N.
We propose and analyze a class of actor-critic algorithms for simulation-based optimization of a Markov decision process over a parameterized family of randomized stationary policies. These are two-time-scale algorithms in which the critic uses TD learning with a linear approximation architecture and the actor is updated in an approximate gradient direction based on information provided by the critic. We show that the features for the critic should span a subspace prescribed by the choice of parameterization of the actor. We conclude by discussing convergence properties and some open problems.
Actor-Critic Algorithms
Konda, Vijay R., Tsitsiklis, John N.
We propose and analyze a class of actor-critic algorithms for simulation-based optimization of a Markov decision process over a parameterized family of randomized stationary policies. These are two-time-scale algorithms in which the critic uses TD learning with a linear approximation architecture and the actor is updated in an approximate gradient direction based on information provided by the critic. We show that the features for the critic should span a subspace prescribed by the choice of parameterization of the actor. We conclude by discussing convergence properties and some open problems.
Actor-Critic Algorithms
Konda, Vijay R., Tsitsiklis, John N.
We propose and analyze a class of actor-critic algorithms for simulation-based optimization of a Markov decision process over a parameterized family of randomized stationary policies. These are two-time-scale algorithms in which the critic uses TD learning with a linear approximation architecture and the actor is updated in an approximate gradient direction based on information provided bythe critic. We show that the features for the critic should span a subspace prescribed by the choice of parameterization of the actor. We conclude by discussing convergence properties and some open problems.
Stable LInear Approximations to Dynamic Programming for Stochastic Control Problems with Local Transitions
Roy, Benjamin Van, Tsitsiklis, John N.
Recently, however, there have been some successful applications of neural networks in a totally different context - that of sequential decision making under uncertainty (stochastic control). Stochastic control problems have been studied extensively in the operations research and control theory literature for a long time, using the methodology of dynamic programming [Bertsekas, 1995]. In dynamic programming, the most important object is the cost-to-go (or value) junction, which evaluates the expected future 1046 B. V. ROY, 1. N. TSITSIKLIS
Stable LInear Approximations to Dynamic Programming for Stochastic Control Problems with Local Transitions
Roy, Benjamin Van, Tsitsiklis, John N.
Recently, however, there have been some successful applications of neural networks in a totally different context - that of sequential decision making under uncertainty (stochastic control). Stochastic control problems have been studied extensively in the operations research and control theory literature for a long time, using the methodology of dynamic programming [Bertsekas, 1995]. In dynamic programming, the most important object is the cost-to-go (or value) junction, which evaluates the expected future 1046 B. V. ROY, 1. N. TSITSIKLIS